Porting a Kalman filter based track fit to NVIDIA CUDA
نویسندگان
چکیده
Finding particle trajectories is usually the most time consuming part of modern experiments in high energy physics. In many present experiments with high track densities and complicated event topologies a Kalman filter based track fit is used already at this combinatorial part of the event reconstruction. Therefore speed of the track fitting algorithm becomes very important for the total processing time. In 2007 a Kalman filter based track fitting algorithm of the CBM experiment [1] had been ported to SSE and the Cell SPE [2]. On the CPU a speedup of 10000 compared to the initial version was achieved. To gain this speedup two major changes to the existing Kalman filter implementation were done. The magnetic field map was replaced by a functional approximation based on fourth order polynomials. In addition the algorithm was tuned to work with single precision calculations only, without the need for compensating computations or numerical instabilities. Today CPUs are no longer able to increase their peak performance by increasing clock speeds. Modern GPUs however continue to increase their peak performance utilising manycore architectures, with NVIDIA GT200 based GPUs nearly reaching 1 TFlop/s in single precision [3]. Thus they are a promising candidate for further acceleration of the algorithm. This performance comes at a cost. Different than on CPUs access to the main memory is not cached, all threads running on one multiprocessor, which is similar to a core of a GPU, share a small 16 kB processor local storage called shared memory which needs to be explicitly programmed. To compensate for that each multiprocessor has a large register file of 16384 registers and can have 1024 threads concurrently active. On eight ALUs it processes warps of 32 threads in 4 cycles, switching between warps without costs, allowing to hide memory accesses by calculations. A GT200 chip contains 30 of these multiprocessors. In contrast to the single instruction multiple data (SIMD) based architectures of the NVIDIA GPUs are based on a single instruction multiple threads (SIMT) model, which means each ALU is connected with an own instruction counter. Therefore the algorithm needs to be parallized on a thread level instead of the below as in the SIMD case. While in the SIMD case we have one thread that works on a vector of numbers, in SIMT we have multiple threads working on scalars. However in both cases a number of arithmetic units is fed by the same instruction decoder, making it important for the code not to rely on different code branches depending on the data of each thread. The original port relied on operator overloading to have only one implementation of the algorithm for multiple plat-
منابع مشابه
ALICE TPC Online Tracking on GPU
For the ALICE High Level Trigger a fast tracking algorithm was developed by Sergey Gorbunov based on the Cellular Automaton method and the Kalman filter [1], that is currently installed in the HLT. For an efficient handling of upcoming lead-lead collisions in 2010 with a tremendous increase of clusters and tracks, possibilities for a better usage of parallelism and many core hardware were analy...
متن کاملPorting NAHUJ to CUDA
This white-paper reports on an enabling effort that involves porting a legacy 2D fluid dynamics Fortran code to NVIDIA GPUs. Given the complexity of both code and underlying (custom) numerical method, the natural choice was to use NVIDIA CUDA C to achieve the best possible performance. We achieved over 4.5x speed-up on a single K20 compared to the original code executed on a dual-socket E5-2687W.
متن کاملFast Histograms using Adaptive CUDA Streams
Histograms are widely used in medical imaging, network intrusion detection, packet analysis and other streambased high throughput applications. However, while porting such software stacks to the GPU, the computation of the histogram is a typical bottleneck primarily due to the large impact on kernel speed by atomic operations. In this work, we propose a stream-based model implemented in CUDA, u...
متن کاملAn Incremental Approach to Porting Complex Scientific Applications to GPU/CUDA
This paper proposes and describes a developed methodology to port complex scientific applications originally written in FORTRAN to the nVidia CUDA. The process was developed and validated by porting an existing FORTRAN weather and forecasting algorithm to a GPU parallel paradigm. We believe that the proposed porting methodology described can be successfully utilized in several other existing sc...
متن کاملTranformation of CPU-based Applications To Leverage on Graphics Processors using CUDA
Scientific computation requires a great amount of computing power especially in floating-point operation but a high-end multi-cores processor is currently limited in terms of floating point operation performance and parallelization. Recent technological advancement has made parallel computing technically and financially feasible using Compute Unified Device Architecture (CUDA) developed by NVID...
متن کامل